1  Simple Regressions

Based on Wooldridge (2019), Chaptes 1 and 2

1.1 Models and Data

What is econometrics?

  • Econometrics = use of statistical methods to analyze economic data

    • Econometric methods are used in many other fields, like social science, medicine, ect.
  • Econometricians typically analyze nonexperimental data

Typical goals of econometric analysis

  • Estimating relationships between economic variables

  • Testing economic theories and hypothesis

  • Forecasting economic variables

  • Evaluating and implementing government and business policy

Steps in econometric analysis

  1. Economic model (this step is often skipped)

  2. Econometric model


1.1.1 Economic models

  • Micro- or macromodels, growth models, models of open economies, etc.

  • Often use optimizing behavior, equilibrium modeling, …

  • Establish relationships between economic variables

  • Examples: demand equations, pricing equations, Euler equations …

Economic model of crime (Becker (1968))

An equation for criminal activity is derived, based on utility maximization which results in

y = f(x_1, x_2, \ldots , x_k)

  • Dependent variable

    • y = Hours spent in criminal activities
  • Explanatory variables x_j

    • “Wage” of criminal activities
    • Wage for legal employment
    • Other income
    • Probability of getting caught
    • Probability of conviction if caught
    • Expected sentence
    • Family background
    • Talent for Crime, moral character
  • The functional form of the relationship is not specified

  • The equation above could have been postulated without economic modeling

    • But in this case, the model lacks a theoretical foundation
      • If we have a theoretical model, we can often derive the expected sign of the coefficients or even guess the magnitude
      • This can be compared to the estimated coefficients, and if the expectations are not met, we can search for a rationale

Economic Model of job training and worker productivity

  • What is effect of additional training on worker productivity?

  • Formal economic theory not really needed to derive equation but is clearly possible:

wage = f(educ, exper, \ldots , training)

  • Dependent variable

    • wage = hourly wage
  • Explanatory variables x_j

    • educ = years of formal education
    • exper = years of work force experience
    • training = weeks spent in job training
  • Other factors may be relevant as well, but these are the most important (?)


1.1.2 Econometric models

Econometric model of criminal activity

  • The functional form has to be specified

  • Variables may have to be approximated by other quantities (leading to measurement errors)

crime = \beta_{0} + \beta_{1} { wage } + \beta_{2} { othinc } + \beta_{3} { freqarr } + \beta_{4} { freqconv } + \\ \beta_{5} { avgsen } + \beta_{6} { age } + u

  • crime … measure of criminal activity
  • wage … wage for legal employment
  • othinc … other income
  • freqarr … frequency of prior arrests
  • freqcon …frequency of conviction
  • avgsen … Average sentence length after conviction
  • age … age
  • u … error term, which contains unobserved factors (lack of data), like moral character, wage in criminal activity, family background, etc. Oddly enough, it is this error term, which attracts the most attention in econometrics

Econometric model of job training and worker productivity

wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u

  • wage … hourly wage

  • educ … years in formal education

  • exper … years of workforce experience

  • training … weeks spent in job training

  • u … error term representing unobserved determinants of the wage like innate ability, quality of education, family background

\text{ }

  • As mentioned above, most of econometrics deals with the specification of the error u. As we will see, this is essential for a causal interpretation of the estimates

  • Econometric models may also be used for hypothesis testing

    • For example, the parameter \beta_3 represents the effect of training on wages
      • How large is this effect? Is it even different from zero?

1.1.3 Data

  • Econometric analysis requires data and there are different kinds of economic data sets

    • Cross-sectional data

    • Time series data

    • Pooled cross sections

    • Panel/Longitudinal data

  • Econometric methods depend on the nature of the data used

    • Different data sets lead to different estimation problems. Use of inappropriate methods may lead to misleading results
  • Cross-sectional data sets

    • Sample of individuals, households, firms, cities, states, countries or other units of interest at a given point of time/in a given period

    • Cross-sectional observations are more or less independent

    • For example, pure random sampling from a population

    • Sometimes pure random sampling is violated, e.g., units refuse to respond in surveys, or if sampling is characterized by clustering (this usually leads to autocorrelation, heteroscedasticity or sample selection problems)

    • Cross-sectional data are typically encountered in applied microeconomics


# Cross-sectional data set on wages and other characteristics. Look especially at indicator variables
library(wooldridge)
data(wage1) 

head(wage1, 10)
          wage educ exper tenure nonwhite female married numdep smsa northcen south
      1   3.10   11     2      0        0      1       0      2    1        0     0
      2   3.24   12    22      2        0      1       1      3    1        0     0
      3   3.00   11     2      0        0      0       0      2    0        0     0
      4   6.00    8    44     28        0      0       1      0    1        0     0
      5   5.30   12     7      2        0      0       1      1    0        0     0
      6   8.75   16     9      8        0      0       1      0    1        0     0
      7  11.25   18    15      7        0      0       0      0    1        0     0
      8   5.00   12     5      3        0      1       0      0    1        0     0
      9   3.60   12    26      4        0      1       0      2    1        0     0
      10 18.18   17    22     21        0      0       1      0    1        0     0
         west construc ndurman trcommpu trade services profserv profocc clerocc
      1     1        0       0        0     0        0        0       0       0
      2     1        0       0        0     0        1        0       0       0
      3     1        0       0        0     1        0        0       0       0
      4     1        0       0        0     0        0        0       0       1
      5     1        0       0        0     0        0        0       0       0
      6     1        0       0        0     0        0        1       1       0
      7     1        0       0        0     1        0        0       1       0
      8     1        0       0        0     0        0        0       1       0
      9     1        0       0        0     1        0        0       1       0
      10    1        0       0        0     0        0        0       1       0
         servocc    lwage expersq tenursq
      1        0 1.131402       4       0
      2        1 1.175573     484       4
      3        0 1.098612       4       0
      4        0 1.791759    1936     784
      5        0 1.667707      49       4
      6        0 2.169054      81      64
      7        0 2.420368     225      49
      8        0 1.609438      25       9
      9        0 1.280934     676      16
      10       0 2.900322     484     441
# or
library(gt) # for pretty html-table plots

gt(head(wage1,10))
wage educ exper tenure nonwhite female married numdep smsa northcen south west construc ndurman trcommpu trade services profserv profocc clerocc servocc lwage expersq tenursq
3.10 11 2 0 0 1 0 2 1 0 0 1 0 0 0 0 0 0 0 0 0 1.131402 4 0
3.24 12 22 2 0 1 1 3 1 0 0 1 0 0 0 0 1 0 0 0 1 1.175573 484 4
3.00 11 2 0 0 0 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 1.098612 4 0
6.00 8 44 28 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1.791759 1936 784
5.30 12 7 2 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1.667707 49 4
8.75 16 9 8 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 2.169054 81 64
11.25 18 15 7 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 2.420368 225 49
5.00 12 5 3 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1.609438 25 9
3.60 12 26 4 0 1 0 2 1 0 0 1 0 0 0 1 0 0 1 0 0 1.280934 676 16
18.18 17 22 21 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 2.900322 484 441

  • Time series data

    • Observations of a variable or several variables over time

    • For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales, …

    • Time series observations are typically serially correlated

    • Ordering of observations conveys important information

    • Data frequency: daily, weekly, monthly, quarterly, annually, high frequency data

    • Typical features of time series: trends and seasonality

    • Typical applications: applied macroeconomics and finance


# Time series data on minimum wages and related variables for Puerto Rico
library(gt) # for pretty html-table plots
library(wooldridge)
data(prminwge)

gt( prminwge[1:20, c("year", "avgmin", "avgcov", "prunemp", "prgnp")] )
year avgmin avgcov prunemp prgnp
1950 0.198 0.201 15.4 878.7
1951 0.209 0.207 16.0 925.0
1952 0.225 0.226 14.8 1015.9
1953 0.311 0.231 14.5 1081.3
1954 0.313 0.224 15.3 1104.4
1955 0.369 0.236 13.2 1138.5
1956 0.447 0.245 13.3 1185.1
1957 0.488 0.244 12.8 1221.8
1958 0.555 0.238 14.2 1258.4
1959 0.588 0.260 13.3 1363.6
1960 0.616 0.270 11.8 1473.2
1961 0.608 0.269 12.7 1562.8
1962 0.707 0.279 12.8 1683.9
1963 0.723 0.279 11.0 1820.7
1964 0.809 0.294 11.2 1916.8
1965 0.834 0.302 11.7 2083.0
1966 0.854 0.444 12.3 2223.2
1967 0.971 0.448 11.6 2328.4
1968 1.104 0.455 10.3 2455.3
1969 1.149 0.455 10.3 2684.0

  • Pooled cross sections

    • Two or more cross sections are combined in one data set

    • Cross sections are drawn independently of each other

    • Pooled cross sections often used to evaluate policy changes

  • Example:

    • Evaluate effect of change in property taxes on house prices

      • Random sample of house prices for the year 1993

      • A new random sample of house prices for the year 1995

      • Compare before/after (1993: before reform, 1995: after reform)


  • Panel or longitudinal data

    • The same cross-sectional units are followed over time. Therefore, wide panels are basically pooled crossections with the very same units (which are many)

    • Long panels are time series for several units (e.g., countries or counties)

    • Panel data have a cross-sectional and a time series dimension. So we have two id-variables

    • Panel data can be used to account for time-invariant unobservable factors

    • Panel data can also be used to model lagged responses

  • Example:

    • City crime statistics; each city is observed for serveral years

      • Time-invariant unobserved city characteristics may be modeled

      • Effect of police on crime rates may exhibit time lag


# Panel data set on city crime statistics
library(wooldridge)
data(countymurders)

gt( countymurders[ (countymurders$year >= 1990 & countymurders$countyid <= 1005), 
                      c("countyid", "year", "murders", "popul", "percblack", 
                        "percmale", "rpcpersinc")] )
countyid year murders popul percblack percmale rpcpersinc
1001 1990 1 34512 20.19000 40.46000 10975.24
1001 1991 1 35024 20.27000 40.48000 11152.39
1001 1992 1 35560 20.34000 40.51000 11263.97
1001 1993 1 37027 20.48505 48.68339 11312.82
1001 1994 1 38027 20.64849 48.71013 11541.15
1001 1995 5 38957 20.87686 48.72552 11680.74
1001 1996 7 40061 20.97551 48.70073 11852.76
1003 1990 7 99200 13.01000 41.30000 11600.30
1003 1991 3 102224 13.04000 41.37000 11854.09
1003 1992 5 105344 13.07000 41.43000 12124.56
1003 1993 7 111018 13.17624 48.69210 12645.61
1003 1994 5 115266 13.28579 48.73163 13012.65
1003 1995 13 119373 13.42347 48.82176 13327.95
1003 1996 6 123023 13.49666 48.83233 13583.02
1005 1990 4 25532 44.22000 39.38000 9997.83
1005 1991 4 25728 44.44000 39.43000 10371.41
1005 1992 0 25932 44.67000 39.44000 11039.38
1005 1993 3 26461 45.28930 48.74721 10721.85
1005 1994 3 26445 45.70240 49.01115 10912.72
1005 1995 3 26337 46.00372 49.12860 10702.64
1005 1996 1 26475 46.19075 49.15203 10760.51

1.2 Causality

Definition of causal effect of x on y: \ \ x \rightarrow y

  • How does variable y change if variable x is changed but all other relevant factors are held constant

    • Most economic questions are ceteris paribus questions

    • It is useful to describe how an experiment would have to be designed to infer the causal effect in question (see examples below)

Simply establishing a relationship – correlation – between variables is not sufficient. Correlation alone says nothing about causality !!!

  • The question is, whether a found effect (correlation) between x and y can be considered as causal. There are several possibilities:

    • x \rightarrow y

    • x \leftarrow y

    • x \leftrightarrows y

    • z_j \rightarrow x \text{ and } z_j \rightarrow y, \ \ldots

  • If we have controlled for enough other variables z_j, then the estimated ceteris paribus effect can often be considered to be causal (but not always, as not all variables are observable) 1

  • However, it is typically difficult to establish causality and we always need some identifying assumptions, which should be credible


1.2.1 Some Examples

“Post hoc, ergo propter hoc” fallacy 2

Figure 1.1: Does carring an umbrellar in the morning causes rainfall in the afternoon? What case?

Further examples

Causal effect of fertilizer on crop yield

  • “By how much will the production of soybeans increase if one increases the amount of fertilizer applied to the ground”
  • Implicit assumption: all other factors z_j that influence crop yield such as quality of land, rainfall, presence of parasites etc. are held fixed

Experiment:

  • Choose several one-acre plots of land; randomly assign different amounts of fertilizer to the different plots; compare yields
  • Experiment works because amount of fertilizer applied is unrelated to other factors (including the original crop yield y) influencing crop yields

Measuring the return to education

  • “If a person is chosen from the population and given another year of education, by how much will his or her wage increase?”
  • Implicit assumption: all other factors z_j that influence wages such as experience, family background, intelligence etc. are held fixed

Experiment:

  • Choose a group of people; randomly assign different amounts of education to them (infeasible!); compare wage outcomes
  • Problem without random assignment: amount of education is related to other factors that influence wages (e.g., intelligence or diligence);
    this is a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem

Effect of law enforcement on city crime level

  • “If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall?”
  • Alternatively: “If two cities are the same in all respects, except that city A has ten more police officers than city B, by how much would the two cities‘ crime rates differ?”

Experiment:

  • Randomly assign number of police officers to a large number of cities
  • In reality, number of police officers will be determined by crime rate – simultaneous determination of crime and number of police;
    this is mainly a x \leftrightarrows y – problem

Effect of the minimum wage on unemployment

  • “By how much (if at all) will unemployment increase if the minimum wage is increased by a certain amount (holding other things fixed)?”

Experiment:

  • Government randomly chooses minimum wage each year and observes unemployment outcomes. The experiment will work because level of minimum wage is unrelated to other factors determining unemployment
  • In reality, the level of the minimum wage will depend on political and economic factors that also influence unemployment;
    mainly a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem

1.3 The Simple Regression Model

Definition of the simple linear regression model:

y = \beta_0 + \beta_1 x + u \tag{1.1}

  • Thereby

    • \ y … Dependent variable, explained variable, response variable or regressand
    • \ x … Independent variable, explanatory variable or regressor
    • \ \beta_0 … Intercept
    • \ \beta_1 … Slope parameter
    • \ u … Error term, disturbance, unobserved factors with E(u)=0, which is not restrictive because of \beta_0

This is a simple regression model, because we have only one explanatory variable.

  • Equation 1.1 describes what change in y we can expect if x changes. If follows:

    \dfrac {dE(y|x)}{dx} \ = \ \beta_1 + \dfrac {dE(u|x)}{dx} \ = \ \beta_1

    as long as \dfrac {dE(u|x)}{dx} = 0

  • Interpretation of \beta_1: By how much does the dependent variable change (on average, as u always vary in some way) if the independent variable is increased by one unit?

    • This interpretation is only correct if all other things (contained in u) remain (on average) constant when the independent variable x is increased by one unit!

Remark: The simple linear regression model is rarely applicable in practice but its discussion is useful for pedagogical reasons

  • Using a simple regression model we usually have a \ (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem rendering the causal interpretation of \beta_1 incorrect in most cases

1.3.1 Some Examples

\text{ }

  • A simple wage equation: wage = \beta_0 + \beta_1 educ + u
    • \beta_1 measures the change in hourly wage given another year of education, holding all other factors fixed
    • u represents labor force experience, tenure with current employer, work ethic, intelligence, etc. \text{ }
  • Soybean yield and fertilizer: yield = \beta_0 + \beta_1 fertilizer + u
    • \beta_1 measures the effect of fertilizer on yield, holding all other factors fixed
    • u represents unobserved (or omitted) factors like Rainfall, land quality, presence of parasites, etc.

1.3.2 Conditional mean independence assumption

When is a causal interpretation of Equation 1.1 justified?

  • Conditional mean independence assumption

E(u \, | \, x) = E(u) = 0 \tag{1.2}

  • The explanatory variable must not contain any information about the mean of the unobserved factors in u

    • So knowing something about x doesn‘t give us information about u

    • This leads to \frac {dE(u \mid x)}{dx}=0 as required. If this assumption is satisfied, we actually have a (x \rightarrow y) – case

  • Regarding the wage example wage = \beta_0 + \beta_1 educ + u ability is likely an important, but often unobserved factor for the obtained wage of a particular individual. As ability is not an explicit variable in the model, it is contained within u

    • The conditional mean independence assumption is unlikely to hold in this case because individuals with more education will also be more capable on average. Knowing something about the education (variable x) of a particular individual therefore contains some information about the ability of that individual (which is in u)

      • Hence, E(u \, | \, x) \neq 0 easily possible in this case

      • Basically, we have the (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being ability

  • Regarding the fertilizer example a similar argument holds. Typically, a framer uses more fertilizer if the quality of the soil is bad. Therefore, quality of the soil, which is part of u, influences both crop yield and the amount of fertilizer used. Hence, we once again have a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being quality of soil

    • And furthermore, E(u \, | \, x) \neq 0, as the amount of used fertlizer (variable x) gives as information about the quality of soil, which is part of u \; \Rightarrow \; conditional mean independence assumption is probably violated in this case

1.3.3 Population regression function (PRF)

Taking the conditional expectation of Equation 1.1 we arrive to the so called population (true) regression function

E(y \, | \, x) \ = \ E(\beta_0 + \beta_1 x + u \, | \, x) \ = \ \beta_0 + \beta_1 x + \underbrace {E(u \, | \, x)}_{= \, 0} \tag{1.3}

Because of Equation 1.2, this implies

E(y \, | \, x) \ = \ \beta_0 + \beta_1 x \tag{1.4}

  • This means that the average value of the dependent variable can be expressed as a linear function of the explanatory variable and Equation 1.4 is, in a certain sense, the best possible predictor of y, given the information x and assumption Equation 1.2

  • Furthermore, \beta_1 = \dfrac {dE(y|x)}{dx} That means that a one-unit increase of x changes the conditional expected value (the average) of y by the amount of \beta_1 (if the conditional mean independence assumption is met)

  • For a given value of x, the distribution of y is centered around E(y|x), as illustrated by in Figure 1.2 which shows a graphical representation of the population regression function

Figure 1.2: Population regression line; Source: Wooldridge (2019)

1.3.4 Estimation

  • In order to estimate the regression model one needs data, i.e., a random sample of n observations (y_i, x_i), \ i=1, \ldots , n

  • The task is: Fit as good as possible a regression line through the data points which is an estimation of the PRF:

\hat y_i = \hat \beta_0 + \hat \beta_1 x_i \tag{1.5}

  • The following Figure 1.3 gives an illustration of this problem

Figure 1.3: Estimated regression line; Source: Wooldridge (2019)

Principle of ordinary least squares – OLS

What does “as good as possible” mean?

  • We define the regression residuals \hat u_i as (note, a hat, “^”, always denotes an estimated value)

\hat u_i \ \equiv \ y_i - \hat y_i \ = \ y_i - \underbrace {\hat \beta_0 - \hat \beta_1 x_i}_{\hat y_i} \tag{1.6}

  • We choose \hat \beta_0 and \hat \beta_1 so as to minimize the sum of squared regression residuals

\underset {\hat \beta_0, \hat \beta_1} {\operatorname {min}} \ \sum_{i=1}^n \hat u_i^2 \ \ \rightarrow \ \ \hat \beta_0, \, \hat \beta_1 \tag{1.7}

  • The resulting first order conditions are

\dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =

\quad \quad \quad \sum_{i=1}^n -2 (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0

\dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =

\quad \quad \quad \quad \sum_{i=1}^n - 2x_i (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0


From these first order conditions above we immediately arrive the so called Normal Equations, which are two linear equations in the two variables \hat \beta_0 and \hat \beta_1

\sum_{i=1}^n (\underbrace {y_i - \hat \beta_0 - \hat \beta_1 x_i}_{\hat u_i})= 0 \tag{1.8}

\sum_{i=1}^n x_i ( {y_i - \hat \beta_0 - \hat \beta_1 x_i}) = 0 \tag{1.9}

\frac {1}{n} \sum_{i=1}^n y_i - \hat \beta_0 - \hat \beta_1 \frac {1}{n}\sum_{i=1}^n x_i = 0

  • This imply

\bar y = \hat \beta_0 + \hat \beta_1 \bar x \ \ \Rightarrow \ \ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \tag{1.10}


For calculating the slope parameter \beta_1 we insert Equation 1.10 into the second normal equation, Equation 1.9

\sum_{i=1}^n x_i (y_i - \underbrace {(\bar y - \hat \beta_1 \bar x)}_{\hat \beta_0} - \hat \beta_1 x_i) = 0

  • Dividing by n and expanding the sum leads to

\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \frac {1}{n} \sum_{i=1}^n x_i + \hat \beta_1 \bar x \frac {1}{n} \sum_{i=1}^n x_i - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0 \ \ \Rightarrow

\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \bar x + \hat \beta_1 \bar x^2 - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0

  • Collecting terms by applying the “Steinerschen Verschiebungssatz” we get

\frac {1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = 0

  • This immediately leads to the OLS formula for the slope parameter

\hat \beta_1 \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \tag{1.11}

This equals the sample covariance of y and x divided by the sample variance of x

Formula Equation 1.11 is only defined if there is some variation in the explanatory variable x, i.e., the sample variance of x must not be zero

After having calculated \hat \beta_1 by the formula in Equation 1.11 we get \hat \beta_0 by inserting \hat \beta_1 into formula Equation 1.10


Algebraic properties of OLS

The first normal equation, Equation 1.8, imply:

  1. Regression line always passes through the sample midpoint (\bar x, \bar y), according Equation 1.10

  2. The sum (and average) of the residuals is zero: \sum_{i=1}^n \hat u_i = 0 according to Equation 1.8 and the definition in Equation 1.6

Furthermore, the second normal equation, Equation 1.9 together with the definition of the residuals Equation 1.6 implies:

  1. The regressor x_i and the regression residuals \hat u_i are orthogonal:
    \sum_{i=1}^n x_i \hat u_i=0 i.e., are uncorrelated

This is the extreme important orthogonal property of OLS


Estimation by Methods of Moments

Another approach for estimating the (true) population parameters \beta_0 and \beta_1 is the method of moments procedure, MoM

  • The basis for this is the conditional mean independence assumption, Equation 1.2 E(u \, | \, x) = E(u) = 0 This implies that the covariance between u and x is zero:

\operatorname {Cov}(x,u) \ = \ E \left[ (x-E(x)) \, (u-0) \right] \ =

E(x \, u) - E(x) \underbrace {E(u)}_0 \ = \ E(x \, u) \quad \Rightarrow

E(x \, u) = E_x [x \underbrace {E(u | x)}_0 ] = 0

  • Hence, we have two (population) moment restrictions

E(u) \ = \ E(\underbrace {y-\beta_0-\beta_1 x)}_u=0 \tag{1.12}

E(x \, u) \ = \ E[x \, (y-\beta_0-\beta_1 x)]=0 \tag{1.13}


The method of moments approach to estimate the parameters imposes these two population moments restrictions on the sample data

  • In particular: the population moments are replaced by their sample counterparts

  • The justification is as follows: By the Law of Large Numbers, LLN, the sample moments converge to their population/theoretical counterparts under rather weak assumptions (stationarity, weak dependence). E.g., with increasing sample size n the sample mean of a random variable converge to the expectation of this random variable (compare Theorem A.2)

  • So we can estimate the population moments by the corresponding empirical moments. In particular, we estimate the expectation, E(y), with the arithmetic sample mean \bar y, knowing that by the LLN this sample estimator converges to E(y) with increasing sample size

  • Hence, the population moment conditions, Equation 1.12 and Equation 1.13, can be replaced (estimated) by their corresponding sample means:

\frac {1}{n} \sum_{i=1}^n (y_i-\hat \beta_0-\hat \beta_1 x_i)=0

\frac {1}{n} \sum_{i=1}^n x_i \, (y_i-\hat \beta_0-\hat \beta_1 x_i)=0


However, the above conditions (which the parameters \beta_0 and \beta_1 have to meet) are exactly the same as the first order conditions from minimizing the sum of squared residuals, the normal equations, Equation 1.8 and Equation 1.9, and therefore yield the same solutions.

  • Hence, OLS and MoM estimation yield the very same estimated parameters \hat \beta_0 and \hat \beta_1 in this case. (For an additional analysis of MoM estimation, see Section 2.4.1)

  • Furthermore, the OLS estimator is also equal to the maximum likelihood estimator, ML, assuming normally distributed error terms

    • Maximum likelihood estimation is treated in more detail in Section 10.2. Intuitively, ML means that – for a given sample – the estimated parameters are chosen such that the probability of obtaining the respective sample is maximized
  • Under standard assumptions, OLS, MoM and ML estimators are equivalent (but generally, they can be different!)


1.3.5 An example in R

  • Install R from https://www.r-project.org

  • Install RStudio from https://rstudio.com/products/rstudio/download/#download

  • Start RStudio and install the packages AER and Wooldridge (which we will need very often). For that purpose go to the lower right window, choose the tab Packages, then the tab Install and enter AER and then click Install. If you are asked during the installation whether you want to compile code, type: no (in the lower left window). Repeat the same for the package Wooldridge

  • To input code use the upper left window. To execute code, mark the code in the upper left window and click on the tap Run at the top of the upper left window

  • You will see the results in the lower left window

  • To run the examples from these slides, simply copy the code from the slides (shaded in grey) into the upper left window, mark it and run it


We want to investigate to what extent the success in an election is determined by the expenditures during the campaign.
# We use a data set contained in the "Wooldridge" package 

# We already installed this package, however, if we want to use it in R,  
# we additionally have to load it with the library() command
library(wooldridge)

# Loading the data set "vote1" from the Wooldridge packages with the "data" command
data(vote1)

# printing out the first 6 observation of the data set "vote1" with the command "head()"
head(vote1)
        state district democA voteA expendA expendB prtystrA lexpendA lexpendB
      1    AL        7      1    68 328.296   8.737       41 5.793916 2.167567
      2    AK        1      0    62 626.377 402.477       60 6.439952 5.997638
      3    AZ        2      1    73  99.607   3.065       55 4.601233 1.120048
      4    AZ        3      0    69 319.690  26.281       64 5.767352 3.268846
      5    AR        3      0    75 159.221  60.054       66 5.070293 4.095244
      6    AR        4      1    69 570.155  21.393       46 6.345908 3.063064
          shareA
      1 97.40767
      2 60.88104
      3 97.01476
      4 92.40370
      5 72.61247
      6 96.38355

Plotting the percentage of votes for candidate A versus the share of campaign expenditures from A.
plot(voteA ~ shareA, data=vote1)


Running a regression of voteA on shareA with the command lm() (for linear model)
out <- lm(voteA ~ shareA, data=vote1)
# We stored the results in a list with the freely chosen name "out"

# With coef(out) we print out the estimated coefficients 
# Try to interpret the estimated coefficients 
coef(out)
      (Intercept)      shareA 
       26.8122141   0.4638269
# With fitted(out) we store the fitted values
yhat <- fitted(out)

# With residuals(out) we store the residuals
uhat <- residuals(out)
Checking the orthogonal property of OLS – the correlation between explanatory variable x and the residuals \hat u.
round( cor(uhat, vote1$shareA), digits = 14) 
      [1] 0

# Previous plot plus estimated regression line.
plot(voteA ~ shareA, data=vote1)
abline(out)


Plotting residuals. These should show no systematic pattern.
plot(uhat)
abline(0,0)


Plotting predicted values versus actual values of voteA. Are predictions biased?
plot(yhat ~ voteA, data=vote1)

# 45° line
abline(0,1)


Plotting squared residuals versus fitted values. Useful for detecting a varying variance (heteroscedasticity)
plot(uhat^2 ~ yhat, data=vote1)


Discussion of example

This simple model for the success in an election seems very plausible, however it suffers from a very common problem

  • In this particular example, the conditional mean independence assumption is almost certainly violated. Why?

    • Because the campaign expenditures strongly depend on donations from supporters. The stronger a candidate is in a particular district the more donations he will get and the higher will be the potential campaign expenditures

    • Hence, we have a reversed causality problem here, \ x \leftrightarrows y, or a third variable problem z_j \rightarrow x \text{ and } z_j \rightarrow y, which both lead to E(u|x) \neq 0 in general

    • This probably will lead to a strong overestimation of the effects of campaign expenditures on votes in this particular case

  • Note that although x is very likely correlated with unobserved factors in u, the example above showed that the correlation between x and the sample residuals \hat u is zero – orthogonality property of OLS. Hence, this fact says nothing about whether the conditional mean independence assumption is satisfied or not

  • A possible remedy: Multiple regression model (with variables z as additional variables included in the set of explanatory variables) or tying to identify the x \rightarrow y relationship with external information (like instrumental variables; we will deal with this approach in Chapter 7)


1.3.6 Measures of Goodness-of-Fit

How well does the explanatory variable explain the dependent variable?

Measures of Variation

SST = \sum\nolimits_{i=1}^n (y_i - \bar y)^2, \quad SSE = \sum\nolimits_{i=1}^n (\hat y_i - \bar y)^2, \quad SSR =\sum\nolimits_{i=1}^n \hat u_i^2

  • SST is total sum of squares, represents total variation in the dependent variable

  • SSE is explained sum of squares, represents variation explained by regression

  • SSR is residual sum of squares, represents variation not explained by regression

Decomposition of total variation, (because of y_i = \hat y_i + \hat u_i, \sum_i x_i \hat u_i=0 and \sum_i \hat u_i=0)

SST = SSE + SSR \tag{1.14}

Goodness-of-Fit measure

R^2 \ \equiv \ \dfrac {SSE}{SST}\ = \ 1 - \dfrac {SSR}{SST} \tag{1.15}

The R-squared measures the fraction of the total variation in y that is explained by the regression


Example

# Running once more the regression of voteA on shareA with the command lm() 
out <- lm(voteA ~ shareA, data=vote1)

# Printing a summary of the regression
summary(out)
      
      Call:
      lm(formula = voteA ~ shareA, data = vote1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -16.8919  -4.0660  -0.1682   3.4965  29.9772 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 26.81221    0.88721   30.22   <2e-16 ***
      shareA       0.46383    0.01454   31.90   <2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 6.385 on 171 degrees of freedom
      Multiple R-squared:  0.8561,  Adjusted R-squared:  0.8553 
      F-statistic:  1018 on 1 and 171 DF,  p-value: < 2.2e-16
# Caution: A high R-squared does not mean that the regression has a causal interpretation!

1.3.7 Statistical Properties of OLS

  • The OLS parameters estimates (estimated coefficients) are functions of random variables and thus random variables themselves

  • We are interested in the moments and the distribution of the estimated coefficients, especially in the expectations and variances

  • Three questions are of particular interest:

    • Are the OLS estimates unbiased, i.e., E(\hat \beta_i) = \beta_i \, ?

    • How precise are our parameter estimates, i.e., how large is their variance \operatorname {Var}(\hat \beta_i) \; ?

    • How are the estimated OLS coefficients distributed?


Unbiasedness of OLS

Theorem 1.1 (Unbiasedness of OLS) Given a random sample and conditional mean independence of u_i from x we state:

E(\hat \beta_0)=\beta_0, \ \ E(\hat \beta_1)=\beta_1

From Equation 1.11 we have

\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \tag{1.16}

We substitute for y_i = \beta_0 + \beta_1 x_i + u_i

\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (\beta_0 + \beta_1 x_i + u_i)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ =

\beta_0 \underbrace{ \left[ \dfrac { \frac{1}{n} {\sum_{i=1}^n (x_i - \bar x)} }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_0 + \beta_1 \underbrace{ \left[ \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) x_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \ \Rightarrow

\hat \beta_1 \, = \, \beta_1 + \underbrace{\frac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i } { \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2}}_{s_x^2} \tag{1.17}

Taking the conditional expectation, considering the conditional mean independence assumption

E(\hat \beta_1 | x_1, \ldots, x_n ) \ = \ \beta_1 + \dfrac {1}{s_x^2} \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) \underbrace {E [ u_i | x_1, \ldots, x_n ]}_0 \ = \ \beta_1 \ \ \text{ and }

E(\hat \beta_1) = E_x [E(\hat \beta_1 | x_1, \ldots, x_n )] = E_x(\beta_1)=\beta_1

by the law of iterated expectations

Interpretation of unbiasedness

  • The estimated coefficients may be smaller or larger than the true values, depending on the sample which is the result of a random draw

  • However, on average, they will be equal to the true value (on average means with regard to repeated samples)

  • In a given sample, estimates may differ considerably from true values


Variances of the OLS estimates

  • Depending on the sample, the estimates will be nearer or farther away from the true values

  • How far can we expect our estimates to be away from the true population values on average? (=sampling variability or sampling errors)

  • Sampling variability is measured by the estimators’ variances

  • We need an additional assumption to easily calculate these variances:

Homoscedasticity of u_i

\operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 \tag{1.18}

  • The values of the explanatory variable must not contain any information about the variability of the unobserved factors

  • Together with the conditional mean independence assumption this furthermore implies that the conditional variance of u is also equal to the unconditional variance of u

\operatorname{Var}(u) = E_x [ \underbrace{ E ( u^{2} | x)-[ E(u | x)]^{2} }_{ \operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 } ] = E_x [ E ( u^{2} | x ) ] = E_x(\sigma^2) = \sigma^2

  • The square root of \sigma^2 is \sigma, the standard deviation of the error

  • Example: y = f(x) + u

Figure 1.4: Homoscedastic errors; Source: Wooldridge 2020

  • Example: wage = f(education) + u

Figure 1.5: Heteroscedastic errors; Source: Wooldridge 2020

Theorem 1.2 (Variance of OLS estimators) Under random sampling, conditional mean independence of u_i from x and homoscedasticity we have

\operatorname{Var}(\widehat{\beta}_{1} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\dfrac{\sigma^{2}}{S S T_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \tag{1.19}

\operatorname{Var} (\widehat{\beta}_{0} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2} \frac {1}{n} \sum_{i=1}^{n} x_{i}^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}} = \dfrac{\sigma^{2} \, \bar {x^{2}}} {SST_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \, \bar {x^{2}} \tag{1.20}

From the proof of Theorem 1.1, we use Equation 1.17

\hat \beta_1 \ = \ \ \beta_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{ {s_x^2} }

Hence, according to Equation 1.18 and random sampling we have

\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac { \frac{1}{n^2} \sum_{i=1}^n (x_i - \bar x)^2 \operatorname {Var} ( u_i |x_1, \ldots, x_n )}{ { (s_x^2)^2 } } \ = \ \dfrac {\sigma^2}{SST_x} \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_x^2}

For the unconditional variance we have, (which is rarely used)

\operatorname{Var}(\hat \beta_1) \ \equiv \ E \left[ \left( \hat \beta_1-E(\beta_1) \right)^2 \right] \ = \ \ E_x \! \left[ \underbrace { E \left( (\hat \beta_1 - \beta_1)^2|x_1, \ldots, x_n \right) }_{\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n]} \right] \\ = \ E_x \! \left[ \dfrac {\sigma^2}{n \, s_x^2} \right] \ = \ \dfrac {1}{n} E_x \! \left[ \dfrac {\sigma^2}{s_x^2} \right]

The sampling variability of the estimated regression coefficients will be the lower,

  • the smaller the variability of the unobserved factors \sigma^2

  • the higher the variation in the explanatory variable s_x^2

    • In particular, the ratio of \sigma / s_x is crucial
  • the larger the sample size n


Estimating the variance of error term

  • According to our homoscedasticity assumption the variance of the error term u is independent of the explanatory variables

\operatorname {Var}(u \, | \, x) = \sigma^2 = \operatorname {Var}(u)

  • However, \sigma^2 is usually unknown, so we need an estimator for this parameter

  • A natural procedure is to use the variance of the sample residuals (note, \bar {\hat u}_i = 0, which is an OLS property, see Section 1.3.4.2, #2)

\hat \sigma^2 = \dfrac {1}{n-2} \sum_{i=1}^n (\hat u_i - \bar {\hat u}_i)^2 \ = \ \dfrac {1}{n-2} \sum_{i=1}^n \hat u_i^2 \tag{1.21}

\text{and} \quad S.E. \ \equiv \ \hat \sigma \ = \ \sqrt{\hat \sigma^2} \tag{1.22}

  • This estimator turns out to be unbiased under or assumptions (see Theorem 2.1)

  • Note that we divide by (n-2) and not by n to calculate the average above. The reason is that for calculating \hat u_i, we priorly need to estimate two parameters, \beta_0, \beta_1. This means, that knowing this two estimated parameters, only (n-2) observations are informative – if we take these two estimated parameters together with (n-2) observations we could infer the remaining two observations. Therefore, this last two observations contain no additional information

    • The number (n-2), which is the number of observations minus the number of estimated model parameters, is referred to as degrees of freedom

Standard errors for regression coefficients

Having an estimate for \sigma^2 and the standard error S.E., we are able to estimate the standard errors of the parameter estimates

Calculation of standard errors for regression coefficients

Using formulas Equation 1.19 and Equation 1.22 we arrive to

se(\hat \beta_1) \ = \ \sqrt{\widehat {\operatorname {Var}}(\hat \beta_1 | x_1, \ldots , x_n)} \ = \ \sqrt{\dfrac {\hat \sigma^2}{SST_x} } \tag{1.23}

  • The estimated standard deviations of the regression coefficients are called standard errors. They measure how precisely the regression coefficients are estimated

The following figures should illustrate the theoretical concepts discussed above
Code
## Monte Carlo simulation for regressions with one explanatory variable 

##################### definition of function #########################################
sims <- function(n=120, rep=5000, sigx=1, sig=1) {
  
  set.seed(13468) # seed for random number generator
  
  # true parameters
  B0 = 0
  B1 = 0.5

  OLS  <- vector(mode = "list", length = rep) # initialing list for storing results
  SOLS <- vector(mode = "list", length = rep) # initialing list for storing results
  
  OLS1  <- vector(mode = "list", length = rep) # initialing list for storing results
  SOLS1 <- vector(mode = "list", length = rep) # initialing list for storing results
  
  ######################### rep loop #################################################
  for (i in (1:rep)) {
    x  =  rnorm(n, mean = 0, sd = sigx)
    u  =  rnorm(n, mean = 0, sd = sig)   
    u1 =  u/2         

    maxx = max(x)
    minx = min(x)
      
    y  = B0 + B1*x + u
    y1 = B0 + B1*x + u1
    
    maxy = max(y)
    miny  = min(y)
      
    OLS[[i]]  =  lm(y ~ x, model = FALSE)
    OLS1[[i]] =  lm(y1 ~ x, model = FALSE)
  }
  ########################## end rep loop ############################################
  
  
  ######################### drawing plots ############################################
  # scatterplot with true and estimated reg-line for last regression
  plot(y ~ x, col="blue")
  abline(OLS[[i]], col="blue")
  abline(c(B0,B1), col="red")
  
  # rep > 100: histogram of estimated parameter b1
  if (rep > 100) {
    b1_distribution <- sapply(OLS, function(x) coef(x)[2])
    hist(b1_distribution, breaks = 30, main="") 
    abline(v=B1, col = "red")
  }

  # true and up to 100 estimated reg-lines 
  plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y", xlab="x")
  for ( i in 1:min(100, rep) ) abline(OLS[[i]], col="lightgrey")
  points(y ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1))
  abline(c(B0,B1), col="red")
  
  # true and up to 100 estimated reg-lines, smaller sig
  plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y1", xlab="x")
  for ( i in 1:min(100, rep) ) abline(OLS1[[i]], col="lightgrey")
  points(y1 ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1))
  abline(c(B0,B1), col="red") 
  
}
######################### end of function ############################################


## Calling function `sims()` with default values for parameters
sims()

(a) Population regression function (red) and estimated regression function of a particular sample (blue), 120 observations, compare Figure 1.2 and Figure 1.3

(b) Unbiasedness: Histogram of 5000 estimates of \beta_1 based on random draws of u and x with 120 observations each. True value of \beta_1 is 0.5

(c) Variance of \hat \beta_1: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey

(d) Variance of \hat \beta_1 with smaller {\sigma} / {\sigma_x}: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey. Variance of \hat \beta_1 is much smaller

Figure 1.6: Population regression function (PRF), estimated regression functions, unbiasedness and variance of estimates


1.3.8 Example once more

We repeat the regression output from our voting example. Look for the new concepts we just discussed in the regression output shown blow.

library(modelsummary)

# Running once more the regression of voteA on shareA with the command lm() 
out <- lm(voteA ~ shareA, data=vote1)

modelsummary(list("Vote for candidate A"=out), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")
Vote for candidate A
Est. S.E. t p 2.5 % 97.5 %
(Intercept)   26.812***  0.887  30.221  <0.001  25.061  28.564
shareA    0.464***  0.015  31.901  <0.001   0.435   0.493
Num.Obs.  173      
R2    0.856   
RMSE    6.35    
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

  1. If every variable z_j, which influences both x and y is known and observable, x \leftrightarrows y reduces to a z_j \rightarrow x \text{ and } z_j \rightarrow y – problem.↩︎

  2. Lat.: Danach, also deswegen.↩︎